Skip to contents

Welcome!

Thank you for your interest in Tools for Automated Data Analysis (TADA). TADA is an open-source tool set built in the R programming language and available for anyone to download and edit to their specific needs. This RMarkdown document walks users through how to download the TADA R package from GitHub, access and parameterize several important functions with a sample dataset, and create basic visualizations. The workflow is similar to a funnel: at each decision point, data that fail QC checks are removed from the core dataset and placed in a separate dataframe, while data that pass are carried into the next step. At the end of the QC checks, the user should be confident that their data are properly documented and applicable to the analysis at hand.

Note: TADA is still under development. New functionality is added weekly, and sometimes we need to make bug fixes in response to tester and user feedback. We appreciate your feedback, patience, and interest in these helpful tools.

Customize or contribute

TADA is housed in a repository on GitHub. Users desiring to review the base code and customize the package for their own purposes may:

  • Clone the repository using Git

  • Open the repository using GitHub Desktop, or

  • Download a zip file of the repository to their desktop.

Interested in contributing to the TADA package? The TADA team highly encourages input and development from users. Check out the Contributing page on the TADA GitHub site for guidance on collaboration conventions.

Install and setup

Users can install the TADA package from GitHub into their R library using the remotes package. Copy and paste the code below into your R or RStudio console to download and install.

install.packages("remotes",
  repos = "http://cran.us.r-project.org"
)
library(remotes)

TADA package relies on other packages, therefore you may be prompted in the console to update dependency packages that have more recent versions available. If you see this prompt, it is recommended to update all of them (enter 1 into the console).

remotes::install_github("USEPA/TADA",
  ref = "develop",
  dependencies = TRUE
)

It’s that easy! The most stable branch for TADA right now is the develop branch. Contributors generally create their own branches based on develop, make some improvements, and then submit a pull request to be reviewed by the TADA Team. Once approved, updates are then merged into the develop branch. However, you are welcome to download any branch you’d like using the ref input in install_github (see code chunk above). This functionality is mainly only useful to TADA package developers/contributors.

The following code block ensures the additional packages needed to run the code in this RMarkdown document are loaded. However, users may also use the package name:: package function notation to avoid the list of library() calls.

list.of.packages <- c("tidyverse")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)

library(tidyverse)

Help pages

All TADA R package functions have their own individual help pages, listed on the Function reference page on the GitHub site. Users can also access the help page for a given function in R or RStudio using the following format (example below): ?TADA::[name of TADA function]

Upload data

Now let’s start using the TADA R package functions. The first step is to bring a dataset into the R environment. TADA is designed to work with Water Quality Portal (WQP) data. This means that all of its functions will look for WQP column names and create new TADA-specific columns based on these elements. Users may upload their own custom dataset into R for use with TADA by ensuring their column names and data formats (e.g. numeric, character) align with WQP profiles.

If you are interested in reviewing the column headers and formats required to run TADA, use the function below, which saves an example spreadsheet to the user’s working directory. You can also take a look at an example dataset, like TADA::Data_Nutrients_UT to get an idea of the data structure and format.

getwd() # find your working directory
## [1] "/home/runner/work/TADA/TADA/vignettes"
template <- TADA::TADA_GetTemplate() # download template to working directory

# uncomment below to review example dataset
# Data_Nutrients_UT <- TADA::Data_Nutrients_UT

TADA_DataRetrieval is built upon USGS’s readWQPdata function within the dataRetrieval package, which uses web service calls to bring WQP data into the R environment. Additionally, TADA_DataRetrieval performs some basic quality control checks via TADA_AutoClean on the data using new TADA-specific columns to preserve the original dataset:

  • Converts key character columns to ALL CAPS for easier harmonization and validation.

  • Removes complete duplicate results.

  • Identifies different classes of result values (numeric, text, percentage, comma-separated numeric, greater than/less than, numbers preceded by a tilde, etc.) and converts values to numeric where feasible.

  • Unifies result and depth units to common units to improve ease of data harmonization. See ?TADA_ConvertResultUnits and ?TADA_ConvertDepthUnits for more information on these processes. These functions can also be run separately if the user wishes to convert result or depth values to different units.

Let’s give it a try. Setting applyautoclean to TRUE in TADA_DataRetrieval means that the basic quality control steps described above are run. TADA_DataRetrieval follows similar parameterization to readWQPdata, but check out the help page or enter ?TADA::TADA_DataRetrieval into the console for more information about input parameters and to see several examples.

# download example data 
# dataset_0  <- TADA_DataRetrieval(
#   organization = c("REDLAKE_WQX", "SFNOES_WQX", "PUEBLO_POJOAQUE", "FONDULAC_WQX",   "PUEBLOOFTESUQUE", "CNENVSER"), 
#   startDate = "2018-01-01", 
#   endDate = "2019-01-01")

# For brevity, we'll skip pinging the WQP and instead load the example dataset:
dataset_0 <- TADA::Data_6Tribes_5y

# Let's take a look at all of the TADA-created columns:
names(dataset_0)[grepl("TADA.", names(dataset_0))]
##  [1] "TADA.ActivityMediaName"                                           
##  [2] "TADA.ResultSampleFractionText"                                    
##  [3] "TADA.CharacteristicName"                                          
##  [4] "TADA.MethodSpeciationName"                                        
##  [5] "TADA.ComparableDataIdentifier"                                    
##  [6] "TADA.ResultMeasureValue"                                          
##  [7] "TADA.ResultMeasureValueDataTypes.Flag"                            
##  [8] "TADA.ResultMeasure.MeasureUnitCode"                               
##  [9] "TADA.WQXResultUnitConversion"                                     
## [10] "TADA.DetectionQuantitationLimitMeasure.MeasureValue"              
## [11] "TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag"
## [12] "TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode"           
## [13] "TADA.ResultDepthHeightMeasure.MeasureValue"                       
## [14] "TADA.ResultDepthHeightMeasure.MeasureValueDataTypes.Flag"         
## [15] "TADA.ResultDepthHeightMeasure.MeasureUnitCode"                    
## [16] "TADA.ActivityDepthHeightMeasure.MeasureValue"                     
## [17] "TADA.ActivityDepthHeightMeasure.MeasureValueDataTypes.Flag"       
## [18] "TADA.ActivityDepthHeightMeasure.MeasureUnitCode"                  
## [19] "TADA.ActivityTopDepthHeightMeasure.MeasureValue"                  
## [20] "TADA.ActivityTopDepthHeightMeasure.MeasureValueDataTypes.Flag"    
## [21] "TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode"               
## [22] "TADA.ActivityBottomDepthHeightMeasure.MeasureValue"               
## [23] "TADA.ActivityBottomDepthHeightMeasure.MeasureValueDataTypes.Flag" 
## [24] "TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode"            
## [25] "TADA.LatitudeMeasure"                                             
## [26] "TADA.LongitudeMeasure"

Currently, the TADA_DataRetrieval function combines three WQP data profiles: Sample Results (Physical/Chemical), Site data, and Project data. This ensures that all important quality control columns are included in the dataset.

Note: USGS and EPA are working together to create WQP 3.0 data profiles. Once released, one data profile will contain the columns critical to TADA, removing the need to combine profiles in this first step. TADA package users likely will not notice a difference in their usage of the TADA_DataRetrieval function, but it will simplify the steps needed to upload a custom or WQP GUI-downloaded dataset into the R package.

Initial data review

Now that we’ve pulled the data into the R session, let’s take a look at it. Note that any column names with the “TADA.” prefix were generated from the TADA_DataRetrieval function.

First, always good to take a look at the data frame dimensions.

Question 1: What are the dimensions of your dataset?

dim(dataset_0) # returns x and of x (as the numbers of rows and columns respectively)
## [1] 153054    149

Before we start filtering and flagging our data, let’s create a function (dimCheck) that performs dimension checks between the results that pass each filter or QC flag check (and are retained) and those that do not (and are removed). These dimension checks ensure that the total number of rows in the original input dataset (all_result_num) equal the the total number of rows added up between the passing (pass_data) and removed (fail_data) data frames.

# defining a dimension check function that compares removed and retained data dimensions against the initial data input
dimCheck <- function(all_result_num, pass_data, fail_data, checkName) {
  # check result numbers after split
  final_result_num <- dim(pass_data)[1] + dim(fail_data)[1]

  # always good to do a dimension check
  if (!all_result_num == final_result_num) {
    print(paste0("Help! Results do not add up between dataset and removed after ", checkName, " check."))
  } else {
    print(paste0("Good to go. Zero results created or destroyed in ", checkName, " check."))
  }
}

# let's first get the total number of rows in the dataset.
all_result_num <- dim(dataset_0)[1]

Next, we can use the TADA TADA_FieldCounts() function to see how many unique values are contained within each column of the dataset. The function can either return all column counts, most, or just the key columns. We’ll try the input with display = "key" and display = "all". Enter ?TADA::TADA_FieldCounts() into the console for more information on this function.

Question 2: Which column should have a unique value in every row and why?

key_counts <- TADA::TADA_FieldCounts(dataset_0, display = "key")

key_counts
##                                Fields Count
## 1                 ActivityCommentText   898
## 2             TADA.CharacteristicName   139
## 3                         ProjectName    32
## 4                      LaboratoryName     9
## 5                    ActivityTypeCode     7
## 6              OrganizationFormalName     6
## 7  DetectionQuantitationLimitTypeName     6
## 8          MonitoringLocationTypeName     5
## 9                           StateCode     4
## 10             TADA.ActivityMediaName     3
## 11       ActivityMediaSubdivisionName     2
all_counts <- TADA::TADA_FieldCounts(dataset_0, display = "all")

all_counts
##                                                                Fields  Count
## 1                                                    ResultIdentifier 153054
## 2                                             TADA.ResultMeasureValue  41055
## 3                                                  ResultMeasureValue  37271
## 4                                                  ActivityIdentifier  21089
## 5                                               ActivityStartDateTime  13376
## 6                                 ResultDetectionQuantitationLimitUrl  11924
## 7                                                         LastUpdated   6989
## 8                          TADA.ResultDepthHeightMeasure.MeasureValue   3938
## 9                               ResultDepthHeightMeasure.MeasureValue   3848
## 10                                             ActivityStartTime.Time   3757
## 11                                                ActivityEndDateTime   1026
## 12                                               ActivityEndTime.Time   1001
## 13                                                ActivityCommentText    898
## 14                                                  ActivityStartDate    854
## 15                                                  AnalysisStartDate    545
## 16                     DetectionQuantitationLimitMeasure.MeasureValue    382
## 17                TADA.DetectionQuantitationLimitMeasure.MeasureValue    381
## 18                                                  ResultCommentText    381
## 19                                               SubjectTaxonomicName    285
## 20                                  ActivityLocation.LongitudeMeasure    279
## 21                                   ActivityLocation.LatitudeMeasure    276
## 22                                       MonitoringLocationIdentifier    235
## 23                                             MonitoringLocationName    230
## 24                                                   LongitudeMeasure    225
## 25                                              TADA.LongitudeMeasure    225
## 26                                                    LatitudeMeasure    222
## 27                                               TADA.LatitudeMeasure    221
## 28                                      TADA.ComparableDataIdentifier    208
## 29                            ActivityDepthHeightMeasure.MeasureValue    166
## 30                       TADA.ActivityDepthHeightMeasure.MeasureValue    163
## 31                                                 CharacteristicName    140
## 32                                            TADA.CharacteristicName    139
## 33                              DataQuality.UpperConfidenceLimitValue     95
## 34                            ResultAnalyticalMethod.MethodIdentifier     70
## 35                                  ResultAnalyticalMethod.MethodName     68
## 36                                                    ActivityEndDate     68
## 37                              DataQuality.LowerConfidenceLimitValue     64
## 38                      ActivityBottomDepthHeightMeasure.MeasureValue     55
## 39                 TADA.ActivityBottomDepthHeightMeasure.MeasureValue     54
## 40                       ResultAnalyticalMethod.MethodDescriptionText     40
## 41                                      ResultMeasure.MeasureUnitCode     35
## 42                                                        ProjectName     32
## 43                                                  ProjectIdentifier     32
## 44                                 TADA.ResultMeasure.MeasureUnitCode     22
## 45                                             ProjectDescriptionText     19
## 46                                  SampleCollectionMethod.MethodName     15
## 47                                                         CountyCode     15
## 48                                                  HUCEightDigitCode     15
## 49                            SampleCollectionMethod.MethodIdentifier     14
## 50                                  MonitoringLocationDescriptionText     14
## 51                       SampleCollectionMethod.MethodDescriptionText     13
## 52                     ResultAnalyticalMethod.MethodIdentifierContext     12
## 53                  DetectionQuantitationLimitMeasure.MeasureUnitCode     11
## 54                                      SampleCollectionEquipmentName     11
## 55                                               MethodSpeciationName     10
## 56                                          TADA.MethodSpeciationName     10
## 57                              TADA.ResultMeasureValueDataTypes.Flag      9
## 58                                               MeasureQualifierCode      9
## 59                                                     LaboratoryName      9
## 60                                           ResultSampleFractionText      8
## 61                                      TADA.ResultSampleFractionText      8
## 62             TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode      8
## 63                                                   ActivityTypeCode      7
## 64                                       ResultDetectionConditionText      7
## 65                     SampleCollectionMethod.MethodIdentifierContext      7
## 66                                 DetectionQuantitationLimitTypeName      6
## 67                                             OrganizationIdentifier      6
## 68                                             OrganizationFormalName      6
## 69                                         MonitoringLocationTypeName      5
## 70                                     ActivityStartTime.TimeZoneCode      4
## 71         TADA.ActivityDepthHeightMeasure.MeasureValueDataTypes.Flag      4
## 72                                             QAPPApprovalAgencyName      4
## 73                                                          StateCode      4
## 74                                                      timeZoneStart      4
## 75                                                  ActivityMediaName      3
## 76                                             TADA.ActivityMediaName      3
## 77                                                ResultValueTypeName      3
## 78  TADA.DetectionQuantitationLimitMeasure.MeasureValueDataTypes.Flag      3
## 79                           ResultDepthHeightMeasure.MeasureUnitCode      3
## 80                                          ActivityRelativeDepthName      3
## 81   TADA.ActivityBottomDepthHeightMeasure.MeasureValueDataTypes.Flag      3
## 82                                                StatisticalBaseCode      3
## 83                                              QAPPApprovedIndicator      3
## 84                                       ActivityEndTime.TimeZoneCode      3
## 85                                                        timeZoneEnd      3
## 86                                       ActivityMediaSubdivisionName      2
## 87                                       TADA.WQXResultUnitConversion      2
## 88           TADA.ResultDepthHeightMeasure.MeasureValueDataTypes.Flag      2
## 89                      TADA.ResultDepthHeightMeasure.MeasureUnitCode      2
## 90                         ActivityDepthHeightMeasure.MeasureUnitCode      2
## 91                    TADA.ActivityDepthHeightMeasure.MeasureUnitCode      2
## 92                         ActivityTopDepthHeightMeasure.MeasureValue      2
## 93                    TADA.ActivityTopDepthHeightMeasure.MeasureValue      2
## 94      TADA.ActivityTopDepthHeightMeasure.MeasureValueDataTypes.Flag      2
## 95                      ActivityTopDepthHeightMeasure.MeasureUnitCode      2
## 96                 TADA.ActivityTopDepthHeightMeasure.MeasureUnitCode      2
## 97                   ActivityBottomDepthHeightMeasure.MeasureUnitCode      2
## 98              TADA.ActivityBottomDepthHeightMeasure.MeasureUnitCode      2
## 99                                                        CountryCode      2
## 100                      HorizontalCoordinateReferenceSystemDatumName      2
## 101                                ActivityConductingOrganizationText      2
## 102                                            ResultStatusIdentifier      2
## 103                                             SourceMapScaleNumeric      2
## 104                                    HorizontalCollectionMethodName      2
## 105                                                      ProviderName      1

Question 3: How many unique ‘TADA.ActivityMediaName’ values exist in your dataset? Are there any media types that are not water?

TADA is currently designed to accommodate water data from the WQP. Let’s ensure that we remove all non-water data first.

# remove data with media type that is not water
removed <- dataset_0 %>%
  dplyr::filter(!TADA.ActivityMediaName %in% c("WATER")) %>%
  dplyr::mutate(TADA.RemovalReason = "Activity media is not water.")

# what other media types exist in dataset?
unique(removed$TADA.ActivityMediaName)
## [1] "BIOLOGICAL" "AIR"
# clean dataset contains only water
dataset <- dataset_0 %>% dplyr::filter(TADA.ActivityMediaName %in% c("WATER"))

dimCheck(all_result_num, dataset, removed, checkName = "Activity Media")
## [1] "Good to go. Zero results created or destroyed in Activity Media check."

Two additional helper functions one can use at any step in the process are TADA_FieldValuesTable() and TADA_FieldValuesPie(). These functions create a summary table and pie chart (respectively) of all the unique values in a given column. Let’s give it a try on OrganizationFormalName, which is a WQP column naming the organization that supplied the result.

TADA::TADA_FieldValuesPie(dataset, field = "OrganizationFormalName")

org_counts <- TADA::TADA_FieldValuesTable(dataset, field = "OrganizationFormalName")

org_counts
##                                             Value  Count
## 1                                    Red Lake DNR 104358
## 2               Fond du Lac Band of Chippewa (MN)  19764
## 3                     Sac and Fox Nation (Tribal)   9943
## 4                      Pueblo Of Tesuque (Tribal)   6798
## 5 Chickasaw Nation Environmental Service (Tribal)   6071
## 6                              Pueblo of Pojoaque   1441

Question 4: When might a user choose to view a column’s unique values as a table rather than in a pie chart?

We can take a quick look at some of the TADA-created columns that review result value types. Because TADA is intended to work with numeric data, at this point, it would be good to remove those result values that are NA without any detection limit info, or contain text or special characters that cannot be converted to numeric. Note that TADA will fill in missing values with detection limit values and units with the TADA_IDCensoredData if the ResultDetectionConditionText and DetectionQuantitationLimitType fields are populated. Use ?TADA_ConvertSpecialChars for more details on result value types and handling and ?TADA_IDCensoredData for details on censored data preparation.

First, we can run TADA_IDCensoredData to fill in as many NA/missing values as possible. We can use TADA_FieldValuesPie to view the censored data flags identified in the data set and their relative frequency. TADA_IDCensoredData sorts result values into detection limit categories (e.g. non-detect, over-detect) based on populated values in the ResultDetectionConditionText and DetectionQuantitationLimitTypeName columns.

You can find the reference tables used to make these decisions in TADA_GetDetCondRef() and TADA_GetDetLimitRef() functions. In some cases, results are missing detection limit/condition info, or there is a conflict in the detection limit and condition. The user may want to remove problematic detection limit data before proceeding. We can also filter for the “problem” data by TADA.CensoredData.Flag and review the unique reasons for data removal.

dataset <- TADA::TADA_IDCensoredData(dataset)
## [1] "TADA_IDCensoredData: There are 105 results in your dataset that are missing ResultDetectionConditionText. TADA requires BOTH ResultDetectionConditionText and DetectionQuantitationLimitTypeName fields to be populated in order to categorize censored data."
## [1] "TADA_IDCensoredData: DetectionQuantitationLimitTypeName column dataset contains value(s) NA which is/are not represented in the DetectionQuantitationLimitTypeName WQX domain table. These data records are placed under the TADA.CensoredData.Flag: Censored but not Categorized, and will not be used in censored data handling methods. Please contact TADA administrators to resolve."
TADA::TADA_FieldValuesPie(dataset, field = "TADA.CensoredData.Flag")

problem_censored <- dataset %>%
  dplyr::filter(!TADA.CensoredData.Flag %in% c("Non-Detect", "Over-Detect", "Other", "Uncensored")) %>%
  dplyr::mutate(TADA.RemovalReason = "Detection limit information contains errors or missing information.")

# Let's take a look at the problematic data that we filtered out (if any)
check <- unique(problem_censored[, c("TADA.CharacteristicName", "ResultDetectionConditionText", "DetectionQuantitationLimitTypeName", "TADA.CensoredData.Flag")])

check
##    TADA.CharacteristicName ResultDetectionConditionText
## 1             PHEOPHYTIN A                         <NA>
## 2           ORTHOPHOSPHATE                         <NA>
## 4                  NITRATE                         <NA>
## 5                  NITRITE                         <NA>
## 13       ALKALINITY, TOTAL                         <NA>
## 17                CHROMIUM                         <NA>
## 18                  COPPER                         <NA>
## 94                 SULFATE                         <NA>
##    DetectionQuantitationLimitTypeName
## 1                                <NA>
## 2              Method Detection Level
## 4              Method Detection Level
## 5              Method Detection Level
## 13             Method Detection Level
## 17             Method Detection Level
## 18             Method Detection Level
## 94             Method Detection Level
##                                               TADA.CensoredData.Flag
## 1  Detection condition is missing and required for censored data ID.
## 2  Detection condition is missing and required for censored data ID.
## 4  Detection condition is missing and required for censored data ID.
## 5  Detection condition is missing and required for censored data ID.
## 13 Detection condition is missing and required for censored data ID.
## 17 Detection condition is missing and required for censored data ID.
## 18 Detection condition is missing and required for censored data ID.
## 94 Detection condition is missing and required for censored data ID.
dataset <- dataset %>% dplyr::filter(TADA.CensoredData.Flag %in% c("Non-Detect", "Over-Detect", "Other", "Uncensored"))

# Let's take a look at the removed data
removed <- plyr::rbind.fill(removed, problem_censored)

# dimension check
dimCheck(all_result_num, dataset, removed, checkName = "Censored Data")
## [1] "Good to go. Zero results created or destroyed in Censored Data check."

Next, we can take a look at the data types present and filter out any non-allowable types.

# take a look at datatypes
flag.datatypes <- TADA::TADA_FieldValuesTable(dataset, field = "TADA.ResultMeasureValueDataTypes.Flag")

# Numeric or numeric-coerced data types
rv_datatypes <- unique(subset(dataset, !is.na(dataset$TADA.ResultMeasureValue))$TADA.ResultMeasureValueDataTypes.Flag)

# Non-numeric data types coerced to NA
na_rv_datatypes <- unique(subset(dataset, is.na(dataset$TADA.ResultMeasureValue))$TADA.ResultMeasureValueDataTypes.Flag)
# these are all of the NOT allowable data types in the dataset.
incompatible_datatype <- dataset %>%
  dplyr::filter(!dataset$TADA.ResultMeasureValueDataTypes.Flag %in% c("Numeric", "Less Than", "Greater Than", "Approximate Value", "Percentage", "Comma-Separated Numeric", "Numeric Range - Averaged", "Result Value/Unit Copied from Detection Limit")) %>%
  dplyr::mutate(TADA.RemovalReason = "Result value type cannot be converted to numeric or no detection limit values provided.")

# take a look at the difficult data types - do they make sense?
check <- unique(incompatible_datatype[, c("TADA.CharacteristicName", "ResultMeasureValue", "TADA.ResultMeasureValue", "ResultMeasure.MeasureUnitCode", "TADA.ResultMeasure.MeasureUnitCode", "TADA.ResultMeasureValueDataTypes.Flag", "DetectionQuantitationLimitMeasure.MeasureValue", "TADA.DetectionQuantitationLimitMeasure.MeasureValue", "DetectionQuantitationLimitMeasure.MeasureUnitCode", "TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode")])

check
##                          TADA.CharacteristicName
## 1                                    CLOUD COVER
## 4                                  WIND VELOCITY
## 5                                    CLOUD COVER
## 15                                   CLOUD COVER
## 16                     LOONS, VISUAL OBSERVATION
## 22                                  HEIGHT, GAGE
## 32                       WATER APPEARANCE (TEXT)
## 34                       WATER APPEARANCE (TEXT)
## 56                       WATER APPEARANCE (TEXT)
## 57                       WATER APPEARANCE (TEXT)
## 59                       WATER APPEARANCE (TEXT)
## 61                       WATER APPEARANCE (TEXT)
## 64                       WATER APPEARANCE (TEXT)
## 70                       WATER APPEARANCE (TEXT)
## 86                       WATER APPEARANCE (TEXT)
## 90                       WATER APPEARANCE (TEXT)
## 107                      WATER APPEARANCE (TEXT)
## 112                      WATER APPEARANCE (TEXT)
## 115                      WATER APPEARANCE (TEXT)
## 152                      WATER APPEARANCE (TEXT)
## 184                      WATER APPEARANCE (TEXT)
## 186                      WATER APPEARANCE (TEXT)
## 236                      WATER APPEARANCE (TEXT)
## 254                      WATER APPEARANCE (TEXT)
## 300                      WATER APPEARANCE (TEXT)
## 314                      WATER APPEARANCE (TEXT)
## 389                      WATER APPEARANCE (TEXT)
## 446                      WATER APPEARANCE (TEXT)
## 512                      WATER APPEARANCE (TEXT)
## 522                      WATER APPEARANCE (TEXT)
## 528                      WATER APPEARANCE (TEXT)
## 598                      WATER APPEARANCE (TEXT)
## 618                      WATER APPEARANCE (TEXT)
## 620                      WATER APPEARANCE (TEXT)
## 692                      WATER APPEARANCE (TEXT)
## 824                      WATER APPEARANCE (TEXT)
## 837                      WATER APPEARANCE (TEXT)
## 984                      WATER APPEARANCE (TEXT)
## 1123                                        FLOW
## 1124                         BAROMETRIC PRESSURE
## 1125                         BAROMETRIC PRESSURE
## 1133                     WATER APPEARANCE (TEXT)
## 1333                     WATER APPEARANCE (TEXT)
## 1523                     WATER APPEARANCE (TEXT)
## 1635                          TEMPERATURE, WATER
## 1636                 DISSOLVED OXYGEN SATURATION
## 1637                                CONDUCTIVITY
## 1638                      TOTAL DISSOLVED SOLIDS
## 1663                     WATER APPEARANCE (TEXT)
## 1939                     WATER APPEARANCE (TEXT)
## 2047                     WATER APPEARANCE (TEXT)
## 2066                     WATER APPEARANCE (TEXT)
## 2268                     WATER APPEARANCE (TEXT)
## 2353 LAKE RECREATIONAL SUITABILITY (CHOICE LIST)
## 2393 LAKE RECREATIONAL SUITABILITY (CHOICE LIST)
## 2401 LAKE RECREATIONAL SUITABILITY (CHOICE LIST)
## 2447                     WATER APPEARANCE (TEXT)
## 3277                     WATER APPEARANCE (TEXT)
## 3283                     WATER APPEARANCE (TEXT)
## 3327 LAKE RECREATIONAL SUITABILITY (CHOICE LIST)
##                                         ResultMeasureValue
## 1                                                    foggy
## 4                                                     Calm
## 5                                              clear_sunny
## 15                                                 raining
## 16                                                    <NA>
## 22                                                    1:78
## 32                                    Clear; Crystal Clear
## 34                                    Stain; Crystal Clear
## 56                            Sediment; some algae present
## 57                               Stain; some algae present
## 59                                                Sediment
## 61                               Stain; Some algae present
## 64                                                   Stain
## 70                               Green; some algae present
## 86                               Green; Some algae present
## 90                                  Stained; Crystal Clear
## 107                                   Green; crystal clear
## 112                           Sediment; Some algae present
## 115   Clear water color; crystal clear physical conditions
## 152                              Clear; some algae present
## 184                                   Stain; crystal clear
## 186                          Green; definite algae present
## 236                                                  Clear
## 254                                                  Green
## 300                                            Tea-colored
## 314                              Clear; Some algae present
## 389                                   Stain; Crystal clear
## 446                                                 Cloudy
## 512    Water color clear; physical condition crystal clear
## 522  Water color stained; physical condition crystal clear
## 528                          Stain; Definite algae present
## 598                                   Green; Crystal Clear
## 618                                Sediment; Crystal clear
## 620                                Sediment; Crystal Clear
## 692                          Green; Definite algae present
## 824                      Sediment; Definitie algae present
## 837                                   Green; Crystal clear
## 984                                                Stained
## 1123                                                  <NA>
## 1124                                         600.7061 mmHG
## 1125                                           600.71 mmHG
## 1133                                  Clear; Crystal clear
## 1333                      Sediment; Definite algae present
## 1523                         Stain; definite algae present
## 1635                                                  None
## 1636                                                  mg/l
## 1637                                                  mg/l
## 1638                                                   PSS
## 1663                                  Clear; crystal clear
## 1939                         Clear; Definite algae present
## 2047                 Sediment; Stage tape-down from bridge
## 2066                      Sediment; definite algae present
## 2268                                                 Muddy
## 2353                                           1.VERY GOOD
## 2393                                                2.GOOD
## 2401                                                3.FAIR
## 2447                               Green; High algal color
## 3277                                       Sediment; Clear
## 3283                                          Green; Clear
## 3327                                                4.POOR
##      TADA.ResultMeasureValue ResultMeasure.MeasureUnitCode
## 1                         NA                             %
## 4                         NA                           mph
## 5                         NA                             %
## 15                        NA                             %
## 16                        NA                          <NA>
## 22                        NA                            ft
## 32                        NA                          <NA>
## 34                        NA                          <NA>
## 56                        NA                          <NA>
## 57                        NA                          <NA>
## 59                        NA                          <NA>
## 61                        NA                          <NA>
## 64                        NA                          <NA>
## 70                        NA                          <NA>
## 86                        NA                          <NA>
## 90                        NA                          <NA>
## 107                       NA                          <NA>
## 112                       NA                          <NA>
## 115                       NA                          <NA>
## 152                       NA                          <NA>
## 184                       NA                          <NA>
## 186                       NA                          <NA>
## 236                       NA                          <NA>
## 254                       NA                          <NA>
## 300                       NA                          <NA>
## 314                       NA                          <NA>
## 389                       NA                          <NA>
## 446                       NA                          <NA>
## 512                       NA                          <NA>
## 522                       NA                          <NA>
## 528                       NA                          <NA>
## 598                       NA                          <NA>
## 618                       NA                          <NA>
## 620                       NA                          <NA>
## 692                       NA                          <NA>
## 824                       NA                          <NA>
## 837                       NA                          <NA>
## 984                       NA                          <NA>
## 1123                      NA                           cfs
## 1124                      NA                           psi
## 1125                      NA                           psi
## 1133                      NA                          <NA>
## 1333                      NA                          <NA>
## 1523                      NA                          <NA>
## 1635                      NA                         deg C
## 1636                      NA                             %
## 1637                      NA                         uS/cm
## 1638                      NA                          mg/L
## 1663                      NA                          <NA>
## 1939                      NA                          <NA>
## 2047                      NA                          <NA>
## 2066                      NA                          <NA>
## 2268                      NA                          <NA>
## 2353                      NA                          <NA>
## 2393                      NA                          <NA>
## 2401                      NA                          <NA>
## 2447                      NA                          <NA>
## 3277                      NA                          <NA>
## 3283                      NA                          <NA>
## 3327                      NA                          <NA>
##      TADA.ResultMeasure.MeasureUnitCode TADA.ResultMeasureValueDataTypes.Flag
## 1                                     %                                  Text
## 4                                 M/SEC                                  Text
## 5                                     %                                  Text
## 15                                    %                                  Text
## 16                                 <NA>                    NA - Not Available
## 22                                   IN                         Coerced to NA
## 32                                 <NA>                                  Text
## 34                                 <NA>                                  Text
## 56                                 <NA>                                  Text
## 57                                 <NA>                                  Text
## 59                                 <NA>                                  Text
## 61                                 <NA>                                  Text
## 64                                 <NA>                                  Text
## 70                                 <NA>                                  Text
## 86                                 <NA>                                  Text
## 90                                 <NA>                                  Text
## 107                                <NA>                                  Text
## 112                                <NA>                                  Text
## 115                                <NA>                                  Text
## 152                                <NA>                                  Text
## 184                                <NA>                                  Text
## 186                                <NA>                                  Text
## 236                                <NA>                                  Text
## 254                                <NA>                                  Text
## 300                                <NA>                                  Text
## 314                                <NA>                                  Text
## 389                                <NA>                                  Text
## 446                                <NA>                                  Text
## 512                                <NA>                                  Text
## 522                                <NA>                                  Text
## 528                                <NA>                                  Text
## 598                                <NA>                                  Text
## 618                                <NA>                                  Text
## 620                                <NA>                                  Text
## 692                                <NA>                                  Text
## 824                                <NA>                                  Text
## 837                                <NA>                                  Text
## 984                                <NA>                                  Text
## 1123                                CFS                    NA - Not Available
## 1124                               G/M2                                  Text
## 1125                               G/M2                                  Text
## 1133                               <NA>                                  Text
## 1333                               <NA>                                  Text
## 1523                               <NA>                                  Text
## 1635                              DEG C                                  Text
## 1636                                  %                                  Text
## 1637                              US/CM                                  Text
## 1638                               UG/L                                  Text
## 1663                               <NA>                                  Text
## 1939                               <NA>                                  Text
## 2047                               <NA>                                  Text
## 2066                               <NA>                                  Text
## 2268                               <NA>                                  Text
## 2353                               <NA>                                  Text
## 2393                               <NA>                                  Text
## 2401                               <NA>                                  Text
## 2447                               <NA>                                  Text
## 3277                               <NA>                                  Text
## 3283                               <NA>                                  Text
## 3327                               <NA>                                  Text
##      DetectionQuantitationLimitMeasure.MeasureValue
## 1                                              <NA>
## 4                                              <NA>
## 5                                              <NA>
## 15                                             <NA>
## 16                                             <NA>
## 22                                             <NA>
## 32                                             <NA>
## 34                                             <NA>
## 56                                             <NA>
## 57                                             <NA>
## 59                                             <NA>
## 61                                             <NA>
## 64                                             <NA>
## 70                                             <NA>
## 86                                             <NA>
## 90                                             <NA>
## 107                                            <NA>
## 112                                            <NA>
## 115                                            <NA>
## 152                                            <NA>
## 184                                            <NA>
## 186                                            <NA>
## 236                                            <NA>
## 254                                            <NA>
## 300                                            <NA>
## 314                                            <NA>
## 389                                            <NA>
## 446                                            <NA>
## 512                                            <NA>
## 522                                            <NA>
## 528                                            <NA>
## 598                                            <NA>
## 618                                            <NA>
## 620                                            <NA>
## 692                                            <NA>
## 824                                            <NA>
## 837                                            <NA>
## 984                                            <NA>
## 1123                                           <NA>
## 1124                                           <NA>
## 1125                                           <NA>
## 1133                                           <NA>
## 1333                                           <NA>
## 1523                                           <NA>
## 1635                                           <NA>
## 1636                                           <NA>
## 1637                                           <NA>
## 1638                                           <NA>
## 1663                                           <NA>
## 1939                                           <NA>
## 2047                                           <NA>
## 2066                                           <NA>
## 2268                                           <NA>
## 2353                                           <NA>
## 2393                                           <NA>
## 2401                                           <NA>
## 2447                                           <NA>
## 3277                                           <NA>
## 3283                                           <NA>
## 3327                                           <NA>
##      TADA.DetectionQuantitationLimitMeasure.MeasureValue
## 1                                                     NA
## 4                                                     NA
## 5                                                     NA
## 15                                                    NA
## 16                                                    NA
## 22                                                    NA
## 32                                                    NA
## 34                                                    NA
## 56                                                    NA
## 57                                                    NA
## 59                                                    NA
## 61                                                    NA
## 64                                                    NA
## 70                                                    NA
## 86                                                    NA
## 90                                                    NA
## 107                                                   NA
## 112                                                   NA
## 115                                                   NA
## 152                                                   NA
## 184                                                   NA
## 186                                                   NA
## 236                                                   NA
## 254                                                   NA
## 300                                                   NA
## 314                                                   NA
## 389                                                   NA
## 446                                                   NA
## 512                                                   NA
## 522                                                   NA
## 528                                                   NA
## 598                                                   NA
## 618                                                   NA
## 620                                                   NA
## 692                                                   NA
## 824                                                   NA
## 837                                                   NA
## 984                                                   NA
## 1123                                                  NA
## 1124                                                  NA
## 1125                                                  NA
## 1133                                                  NA
## 1333                                                  NA
## 1523                                                  NA
## 1635                                                  NA
## 1636                                                  NA
## 1637                                                  NA
## 1638                                                  NA
## 1663                                                  NA
## 1939                                                  NA
## 2047                                                  NA
## 2066                                                  NA
## 2268                                                  NA
## 2353                                                  NA
## 2393                                                  NA
## 2401                                                  NA
## 2447                                                  NA
## 3277                                                  NA
## 3283                                                  NA
## 3327                                                  NA
##      DetectionQuantitationLimitMeasure.MeasureUnitCode
## 1                                                 <NA>
## 4                                                 <NA>
## 5                                                 <NA>
## 15                                                <NA>
## 16                                                <NA>
## 22                                                <NA>
## 32                                                <NA>
## 34                                                <NA>
## 56                                                <NA>
## 57                                                <NA>
## 59                                                <NA>
## 61                                                <NA>
## 64                                                <NA>
## 70                                                <NA>
## 86                                                <NA>
## 90                                                <NA>
## 107                                               <NA>
## 112                                               <NA>
## 115                                               <NA>
## 152                                               <NA>
## 184                                               <NA>
## 186                                               <NA>
## 236                                               <NA>
## 254                                               <NA>
## 300                                               <NA>
## 314                                               <NA>
## 389                                               <NA>
## 446                                               <NA>
## 512                                               <NA>
## 522                                               <NA>
## 528                                               <NA>
## 598                                               <NA>
## 618                                               <NA>
## 620                                               <NA>
## 692                                               <NA>
## 824                                               <NA>
## 837                                               <NA>
## 984                                               <NA>
## 1123                                              <NA>
## 1124                                              <NA>
## 1125                                              <NA>
## 1133                                              <NA>
## 1333                                              <NA>
## 1523                                              <NA>
## 1635                                              <NA>
## 1636                                              <NA>
## 1637                                              <NA>
## 1638                                              <NA>
## 1663                                              <NA>
## 1939                                              <NA>
## 2047                                              <NA>
## 2066                                              <NA>
## 2268                                              <NA>
## 2353                                              <NA>
## 2393                                              <NA>
## 2401                                              <NA>
## 2447                                              <NA>
## 3277                                              <NA>
## 3283                                              <NA>
## 3327                                              <NA>
##      TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode
## 1                                                      <NA>
## 4                                                      <NA>
## 5                                                      <NA>
## 15                                                     <NA>
## 16                                                     <NA>
## 22                                                     <NA>
## 32                                                     <NA>
## 34                                                     <NA>
## 56                                                     <NA>
## 57                                                     <NA>
## 59                                                     <NA>
## 61                                                     <NA>
## 64                                                     <NA>
## 70                                                     <NA>
## 86                                                     <NA>
## 90                                                     <NA>
## 107                                                    <NA>
## 112                                                    <NA>
## 115                                                    <NA>
## 152                                                    <NA>
## 184                                                    <NA>
## 186                                                    <NA>
## 236                                                    <NA>
## 254                                                    <NA>
## 300                                                    <NA>
## 314                                                    <NA>
## 389                                                    <NA>
## 446                                                    <NA>
## 512                                                    <NA>
## 522                                                    <NA>
## 528                                                    <NA>
## 598                                                    <NA>
## 618                                                    <NA>
## 620                                                    <NA>
## 692                                                    <NA>
## 824                                                    <NA>
## 837                                                    <NA>
## 984                                                    <NA>
## 1123                                                   <NA>
## 1124                                                   <NA>
## 1125                                                   <NA>
## 1133                                                   <NA>
## 1333                                                   <NA>
## 1523                                                   <NA>
## 1635                                                   <NA>
## 1636                                                   <NA>
## 1637                                                   <NA>
## 1638                                                   <NA>
## 1663                                                   <NA>
## 1939                                                   <NA>
## 2047                                                   <NA>
## 2066                                                   <NA>
## 2268                                                   <NA>
## 2353                                                   <NA>
## 2393                                                   <NA>
## 2401                                                   <NA>
## 2447                                                   <NA>
## 3277                                                   <NA>
## 3283                                                   <NA>
## 3327                                                   <NA>

Then we can take a closer look at the removed results and run another dimension check on the data set.

# filter data set to include allowable data types
dataset <- dataset %>% dplyr::filter(dataset$TADA.ResultMeasureValueDataTypes.Flag %in% c("Numeric", "Less Than", "Greater Than", "Approximate Value", "Percentage", "Comma-Separated Numeric", "Numeric Range - Averaged", "Result Value/Unit Copied from Detection Limit"))

# create data frame to includ all removed results
removed <- plyr::rbind.fill(removed, incompatible_datatype)

# dimension check 
dimCheck(all_result_num, dataset, removed, checkName = "Result Format")
## [1] "Good to go. Zero results created or destroyed in Result Format check."

Data flagging

We’ve taken a quick look at the raw dataset and split off some data that are not compatible with TADA, now let’s run through some quality control checks. The most important ones to run to ensure your dataset is ready for subsequent steps are TADA_FlagFraction(), TADA_FlagSpeciation(), TADA_FlagResultUnit(), and TADA_FindQCActivities(). With the exception of TADA_FindQCActivities(), these flagging functions leverage WQX’s QAQC Validation Table. TADA_FindQCActivities() uses a TADA-specific domain table users can review with TADA_GetActivityTypeRef(). All QAQC tables are frequently updated in the package to ensure they match the latest version on the web. You can find guidance for using the WQX QAQC Validation Tables in this vignette on the TADA GitHub site.

Bring the QAQC Validation Table into your R session to view or save with the following command:

qaqc_ref <- TADA::TADA_GetWQXCharValRef()

unique(qaqc_ref$Type)
## [1] "CharacteristicFraction"   "CharacteristicMethod"    
## [3] "CharacteristicSpeciation" "CharacteristicUnit"

Question 5: What do you think the qaqc_ref$Type column indicates?

TADA joins this validation table to the data and uses the “Valid” and “Invalid” labels in the Status column to create easily understandable flagging columns for each function. Let’s run these four flagging functions.

dataset_flags <- TADA::TADA_FlagFraction(dataset, clean = FALSE, flaggedonly = FALSE)
## [1] "Rows with invalid sample fractions have been flagged but retained. Review these rows before proceeding and/or set clean = TRUE."
dataset_flags <- TADA::TADA_FlagSpeciation(dataset_flags, clean = "none", flaggedonly = FALSE)
## [1] "Rows with invalid speciations have been flagged but retained. Review these rows before proceeding and/or set clean = 'invalid_only' or 'both'."
dataset_flags <- TADA::TADA_FlagResultUnit(dataset_flags, clean = "none", flaggedonly = FALSE)
## [1] "Rows with invalid result value units have been flagged but retained. Review these rows before proceeding and/or set clean = 'invalid_only' or 'both'."
dataset_flags <- TADA::TADA_FindQCActivities(dataset_flags, clean = FALSE, flaggedonly = FALSE)

dimCheck(all_result_num, dataset_flags, removed, checkName = "Run Flag Functions")
## [1] "Good to go. Zero results created or destroyed in Run Flag Functions check."

Question 6: Did any warnings or messages appear in the console after running these flagging functions? What do they say?

Now that we’ve run all the key flagging functions, let’s take a look at the results and make some decisions.

TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.SampleFraction.Flag")

TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.MethodSpeciation.Flag")

TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.ResultUnit.Flag")

TADA::TADA_FieldValuesPie(dataset_flags, field = "TADA.ActivityType.Flag")

Any results flagged as “Invalid” are recognized in the QAQC Validation Table as having some data quality issue. “NonStandardized” means that the format has not been fully vetted or processed, while “Valid” confirms that the characteristic combination is widely recognized as correctly formatted. Let’s add any invalid results to the removed dataset for later review.

Note: if you find any errors in the QAQC Validation Table, please contact the WQX Help Desk at WQX@epa.gov to help correct it. Thanks in advance!

# grab all the flagged results from the four functions
problem_flagged <- dataset_flags %>%
  filter(TADA.SampleFraction.Flag == "Invalid" | TADA.MethodSpeciation.Flag == "Invalid" | TADA.ResultUnit.Flag == "Invalid" | !TADA.ActivityType.Flag %in% ("Non_QC")) %>%
  dplyr::mutate(TADA.RemovalReason = "Invalid Unit, Method, Speciation, or Activity Type.")

dataset_flags <- dataset_flags %>% dplyr::filter(!ResultIdentifier %in% problem_flagged$ResultIdentifier)

# create data frame of removed results
removed <- plyr::rbind.fill(removed, problem_flagged)

# remove df no longer needed
rm(problem_flagged)

# dimension check
dimCheck(all_result_num, dataset_flags, removed, checkName = "Filter Flag Functions")
## [1] "Good to go. Zero results created or destroyed in Filter Flag Functions check."

Question 7: Are there any other metadata columns that you review and filter in your workflow?

We’ve finished running the recommended flagging functions and removing results that do not pass QC checks. Let’s look at the breakdown of these data in the removed object.

removal <- TADA::TADA_FieldValuesTable(removed, field = "TADA.RemovalReason")

removal
##                                                                                     Value
## 1                                                            Activity media is not water.
## 2 Result value type cannot be converted to numeric or no detection limit values provided.
## 3                                     Invalid Unit, Method, Speciation, or Activity Type.
## 4                     Detection limit information contains errors or missing information.
##   Count
## 1  4679
## 2  3977
## 3  1776
## 4   105

You can review any other columns of interest and create custom domain tables of your “Valid” and “Invalid” criteria using R or Excel. Also check out some of the other flagging functions available in TADA:

  • ?TADA_FindNearbySites() - under active development

  • ?TADA_FindPotentialDuplicatesMultipleOrgs()

  • ?TADA_FindPotentialDuplicatesSingleOrg()

  • ?TADA_FindQAPPApproval()

  • ?TADA_FindQAPPDoc()

  • ?TADA_FlagAboveThreshold()

  • ?TADA_FlagBelowThreshold()

  • ?TADA_FlagContinuousData()

  • ?TADA_FlagCoordinates()

  • ?TADA_FlagMeasureQualifierCode()

  • ?TADA_FlagMethod()

Please let us know of other flagging functions you think would have broad appeal in the TADA package or need assistance brainstorming/developing.

Censored data handling

We have already identified, flagged, and in some cases removed problematic detection limit data from our dataset, but that doesn’t keep them from being difficult. Because we do not know the result value with adequate precision, water quality data users often set non-detect values to some number below the reported detection limit. TADA contains some simple methods for handling detection limits: users may multiply the detection limit by some number between 0 and 1, or convert the detection limit value to a random number between 0 and the detection limit. More complex detection limit estimation requiring regression models (Maximum Likelihood, Kaplan-Meier, Robust Regression on Order Statistics) or similar must be performed outside of the current version of TADA (though future development is planned).

Question 8: How would you parameterize TADA_SimpleCensoredMethods() to make non-detect values equal to the provided detection limit? What would you need to change in the example below?

dataset_cens <- TADA::TADA_SimpleCensoredMethods(dataset_flags, nd_method = "multiplier", nd_multiplier = 0.5, od_method = "as-is")

Let’s take a look at how the censored data handling function affects the TADA.ResultMeasureValueDataTypes.Flag column.

First, we can look use TADA_FieldValuesTable to look at the TADA.ResultMeasureValueDataTypes.Flag column in data set before we ran TADA_SimpleCensoredMethods.

# before
TADA::TADA_FieldValuesTable(dataset_flags, field = "TADA.ResultMeasureValueDataTypes.Flag")
##                                           Value  Count
## 1                                       Numeric 137517
## 2 Result Value/Unit Copied from Detection Limit   4031
## 3                                    Percentage    913
## 4                      Numeric Range - Averaged     33
## 5                                     Less Than     19
## 6                                  Greater Than      3
## 7                       Comma-Separated Numeric      1

Then we can use TADA_FieldValuesTable again to look at the same column after TADA_SimpleCensoredMethods.

# after
TADA::TADA_FieldValuesTable(dataset_cens, field = "TADA.ResultMeasureValueDataTypes.Flag")
##                                              Value  Count
## 1                                          Numeric 137498
## 2 Result Value/Unit Estimated from Detection Limit   4045
## 3                                       Percentage    913
## 4                         Numeric Range - Averaged     33
## 5                                        Less Than     19
## 6    Result Value/Unit Copied from Detection Limit      5
## 7                                     Greater Than      3
## 8                          Comma-Separated Numeric      1

Question 9: Is there a difference between the first and second tables?

If you’d like to start thinking about using statistical methods to estimate detection limit values, check out the ?TADA_Stats function, which accepts user-defined data groupings (or defaults to TADA.ComparableDataIdentifier to determine measurement count, location count, censored data stats, min, max, and percentile stats, and suggests non-detect estimatiom methods based on the number of results, % of dataset censored, and number of censoring levels (detection limits). The decision tree used in the function was outlined in a National Nonpoint Source Tech Memo.

Data exploration

How are you feeling about your test dataset? Does it seem ready for the next step(s) in your analyses? There’s probably a lot more you’d like to look at/filter out before you’re ready to say: QC complete. Let’s first check out characteristics in the dataset using dplyr functions and pipes.

# get table of characteristics with number of results, sites, and organizations
dataset_cens_summary <- dataset_cens %>%
  dplyr::group_by(TADA.CharacteristicName) %>%
  dplyr::summarise(Result_Count = length(ResultIdentifier), Site_Count = length(unique(MonitoringLocationIdentifier)), Org_Count = length(unique(OrganizationIdentifier))) %>%
  dplyr::arrange(desc(Result_Count)) 

You may see a characteristic that you’d like to investigate further in isolation. TADA_FieldValuesPie() will also produce summary pie charts for a given column within a specific characteristic. Let’s take a look.

# go ahead and pick a characteristic name from the table generated above. I picked dissolved oxygen (DO) amd selected OrganizationFormalName as the field to see the relative contribution of each org to DO results
TADA::TADA_FieldValuesPie(dataset_cens, field = "OrganizationFormalName", characteristicName = "DISSOLVED OXYGEN (DO)")

We can view the site locations using a TADA mapping function. In this map, the circles indicate monitoring locations in the dataset; their size corresponds to the number of results collected at that site, while the darker the circle, the more characteristics were sampled at that site.

TADA::TADA_OverviewMap(dataset_cens)

Out of curiosity, let’s take a look at a breakdown of these monitoring location types. Do they all indicate surface water samples? Depending upon your program’s goals and methods, you might want to filter out some of the types you see.

TADA::TADA_FieldValuesPie(dataset_cens, field = "MonitoringLocationTypeName")

One of the next big steps is data harmonization: translating and aggregating synonyms, combining multiple forms/species of certain characteristics, etc. We won’t get to that in this demo (more details can be found here: TADA Module 1: Water Quality Portal Data Discovery and Cleaning or TADA_HarmonizeSynonyms()), but for now we can start looking at data distributions within a single characteristic-speciation-fraction-unit using the plotting functions TADA_Histogram() and TADA_Boxplot(). We can also view a stats table using TADA_Stats.

Let’s first take a look at the column TADA.ComparableDataIdentifier, which breaks down characteristics into groups by name, fraction, speciation, and unit. These four columns are important to evaluate together to ensure the plotted data are sufficiently similar to appear on a single plot together: it doesn’t make sense to plot characteristics with different units or fractions in the same distribution.

# trusty field values table - lets just look at the first few entries with the most associated records
compid <- TADA::TADA_FieldValuesTable(dataset_cens, field = "TADA.ComparableDataIdentifier")

Now that we have an idea for what the TADA.ComparableDataIdentifier looks like, we can check out how it is used to plot distinct characteristic groups.

# Look at a histogram, boxplot, and stats for TADA.ComparableDataIdentifier(s) of your choice.
comp_data_id <- "PH_NA_NA_NONE"

plot_data <- subset(dataset_cens, dataset_cens$TADA.ComparableDataIdentifier %in% comp_data_id)

Question 10: How does selecting the different options on the left side of the histogram change the data displayed? When might you want to use a histogram vs. a boxplot?

Let’s take a look at the histogram and boxplot for the comparable data identifier we selected.

TADA::TADA_Histogram(plot_data, id_cols = "TADA.ComparableDataIdentifier")
## [1] "Plotting function removed 7 results where TADA.ResultMeasureValue = NA. These results cannot be plotted."
TADA::TADA_Boxplot(plot_data, id_cols = "TADA.ComparableDataIdentifier")
## [1] "Plotting function removed 7 results where TADA.ResultMeasureValue = NA. These results cannot be plotted."
stats <- TADA::TADA_Stats(plot_data)
## [1] "Dataset contains 7 results missing both a TADA result value and a detection limit. These values will not be represented in the stats summary table. Suggest removing or handling."

We can also explore depth profiles for selected characteristics at specific site on a single date. There are a few functions that can help with this. First we can use TADA_FlagDepthCategory to place results into various depth categories (surface, middle, and bottom).

# 
dataset_depth <- TADA::TADA_FlagDepthCategory(dataset_cens)
## [1] "TADA_FlagDepthCategory: checking data set for depth values. 70304 results have depth values available."
## [1] "TADA_FlagDepthCategory: assigning depth categories."
## [1] "TADA_FlagDepthCategory: Grouping results by MonitoringLocationIdentifier, OrganizationIdentifier, CharacteristicName, and ActivityStartDate for aggregation for entire water column."
## [1] "TADA_FlagDepthCategory: No aggregation performed."

We can also use another function, TADA_IDDepthProfiles to identify location/date/characteristic combinations in the data set that can be used for depth profile plots or analysis. The default number of values required to identify a location/date/characteristic as a depth profile is 2, but this can be changed by the user. We will specify a larger value, 5, so that any depth profiles identified will have results from at least 5 different depths.

# 
depth_profile_id <- TADA::TADA_IDDepthProfiles(dataset_depth, nvalue = 5)

Question 11: How can TADA_IDDepthProfiles() help users use TADA_DepthProfilePlot most efficiently?

Now, we can use TADA_DepthProfilePlot to plot up to three characteristics against depth. In this example, we will look at pH, secchi depth, and pH.

# 
 TADA::TADA_DepthProfilePlot(dataset_cens, 
                        groups = c('TEMPERATURE, WATER_NA_NA_DEG C', 'DEPTH, SECCHI DISK DEPTH_NA_NA_M', 'PH_NA_NA_NONE'),
                        location = "REDLAKE_WQX-ANKE",
                       activity_date = "2018-10-04",
                       depthcat = TRUE,
                       surfacevalue = 2,
                       bottomvalue = 2,
                       unit = "m")
## [1] "TADA_DepthProfilePlot: Running TADA_DepthCategoryFlag function to add required columns to data frame"
## [1] "TADA_FlagDepthCategory: checking data set for depth values. 70304 results have depth values available."
## [1] "TADA_FlagDepthCategory: assigning depth categories."
## [1] "TADA_FlagDepthCategory: Grouping results by MonitoringLocationIdentifier, OrganizationIdentifier, CharacteristicName, and ActivityStartDate for aggregation for entire water column."
## [1] "TADA_FlagDepthCategory: No aggregation performed."
## [1] "TADA_DepthProfilePlot: Depth unit in data set matches depth unit specified by user for plot. No conversion necessary."
## [1] "TADA_DepthProfilePlot: Identifying available depth profile data."
## [1] "TADA_DepthProfilePlot: Any results for DEPTH, SECCHI DISK DEPTH, DEPTH, SECCHI DISK DEPTH (CHOICE LIST), DEPTH, SECCHI DISK DEPTH REAPPEARS, DEPTH, DATA-LOGGER (NON-PORTED), DEPTH, DATA-LOGGER (PORTED), RBP STREAM DEPTH - RIFFLE, RBP STREAM DEPTH - RUN, THALWEG DEPTH match the depth unit selected for the figure."
## [1] "TADA_DepthProfilePlot: Adding surface delination to figure."
## [1] "TADA_DepthProfilePlot: Adding bottom delination to figure."

Finally, we can download our PASS and FAIL data sets together into an Excel spreadsheet.

dataset_and_removed <- dplyr::bind_rows(dataset_cens, removed)

# Un-comment to download Excel spreadsheet to your working directory
# install.packages(writexl)
# library(writexl)
# writexl::write_xlsx(dataset_and_removed, "NCTCShepherdstownData.xlsx")

TADA R Shiny Modules

Finally, take a look at an alternative workflow for QC’ing WQP data: TADA Shiny Module 1: Data Discovery and Cleaning. This is a Shiny application that runs many of the TADA functions covered in this training document behind a graphical user interface. The shiny application queries the WQP, contains maps and data visualizations, flags suspect data results, handles censored data, and more. You can launch it using the code below.

# download TADA Shiny repository
remotes::install_github("USEPA/TADAShiny", ref = "develop", dependencies = TRUE)

# launch the app locally.
TADAShiny::run_app()

DRAFT Module 1 is also currently hosted on the web with minimal server memory/storage allocated.